# Import of data sets which have been exported at the end of data collection
air_weather_df <- read_csv2(file = "Daten/DataCollection/air_weather_df.csv") %>%
select(-...1)
traffic_df <- read_csv2(file = "Daten/DataCollection/traffic_df.csv") %>%
select(-...1)
traffic_detectors <- read_csv2(file = "Daten/DataCollection/traffic_detectors.csv") %>%
select(-...1)
airquality_stations <- read_csv2(file = "Daten/DataCollection/airquality_stations.csv") %>%
select(-...1)
# Extracting stations and groups from airquality_stations
airquality_station_groups <- airquality_stations %>%
select(name, stationgroups) %>%
distinct()
# At a first sight, it appears that a lot of air quality monitoring data is
# missing before the beginning of 2017. It is therefore decided to exclude
# this data from the analysis
air_weather_df <- air_weather_df %>%
filter(date >= as.Date("2017-01-01"))
# Grouping air quality monitoring data by station
airweather_by_station <- tibble(Station = airquality_station_groups %>% pull(name)) %>%
mutate(messwerte = map(Station, function(x) air_weather_df %>% filter(Station == x)))Explorative data analysis
Exploratory data analysis involves examining the data sets imported as part of the data collection process. The first step is to understand the datasets and their variables, gain basic insights, and perform tasks such as computing CAQI-Index values. Another step is to investigate missing observations and how to deal with them. Finally, the actual analysis begins with examining distributions by calculating summary statistics and creating visualizations. Temporal trends and anomalies as well as outliers can be identified. Correlations, especially linear relationships between pollutant concentrations and weather or traffic data, are then examined.
Loading data sets from data collection
According to the data available to us, there are 20 air quality monitoring stations in Berlin, evenly distributed throughout the city and its immediate surroundings. These can be divided into 3 categories: “Suburb”, “Background” and “Traffic”. “Suburb” stations are located in the suburbs of Berlin, some of them in forests. “Background” includes stations that are located in the city, but collect typical measurements for residential areas. “Traffic” stations are located in the immediate vicinity of a major road, and their readings are likely to be strongly influenced by traffic.
Overview of Datasets
The following data sets contain the essential Data which will be used in the context of the analysis in this work. Those are on one side, as already mentioned, pollutant concentration data, weather data as well as traffic data. All data sets contain records ranging from January 2017 to April/May 2023.
Pollutant Measurement and Weather Data
The first imported dataset contains weather and air pollution measurements recorded at various stations across Berlin. The recorded parameters include the date and time of the respective measurements, a station name, a station category as ‘traffic’, ‘suburb’ or ‘background’, two levels of particulate matter (PM2.5 and PM10), ozone (O3) and nitrogen dioxide (NO2) concentrations. Weather conditions are described by dew point temperature at 2 metres, amount of precipitation, relative humidity at 2 metres, surface air pressure, temperature at 2 metres, wind direction at 100 metres, wind speed at 100 metres and duration of sunshine.
Rows: 1,117,140
Columns: 15
$ date <dttm> 2017-01-01 01:00:00, 2017-01-01 02:00:00, 2017-01…
$ Station <chr> "010 Wedding", "010 Wedding", "010 Wedding", "010 …
$ stationgroups <chr> "background", "background", "background", "backgro…
$ pm25 <dbl> 175, 99, 63, 29, 20, 20, 26, 27, 29, 23, 19, 17, 1…
$ pm10 <dbl> 185, 104, 67, 31, 22, 23, 28, 30, 32, 25, 21, 20, …
$ O3 <dbl> 8, 19, 22, 32, 34, 30, 26, 26, 25, 34, 38, 39, 41,…
$ NO2 <dbl> 48, 37, NA, NA, 21, 24, 24, 25, 26, 22, 22, 21, 21…
$ dewpoint_2m <dbl> -1.3, -1.6, -1.9, -2.3, -2.7, -3.1, -3.5, -4.0, -2…
$ precipitation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ relativehumidity_2m <dbl> 81, 79, 77, 76, 74, 72, 71, 69, 77, 75, 73, 69, 68…
$ surface_pressure <dbl> 1019.2, 1018.7, 1017.9, 1017.4, 1016.6, 1015.7, 10…
$ temperature_2m <dbl> 1.6, 1.7, 1.7, 1.6, 1.5, 1.4, 1.2, 1.1, 0.8, 0.6, …
$ winddirection_100m <dbl> 235, 232, 233, 236, 236, 235, 236, 236, 231, 229, …
$ windspeed_100m <dbl> 7.24, 7.62, 7.92, 7.85, 7.85, 7.96, 8.07, 8.13, 8.…
$ duration_sunlight <dbl> 466, 466, 466, 466, 466, 466, 466, 466, 466, 466, …
Traffic Data
The imported data set with traffic data provides measurements from various stations across Berlin. Each entry includes the station’s identifier, the date and time of the measurement, and a quality indicator for the collected data.
The traffic data include the hourly count and average speed in kilometers per hour of vehicles, both overall and broken down by cars and trucks. It thus provides a detailed view of the city’s traffic flow, differentiating between cars and trucks.
Rows: 12,465,926
Columns: 9
$ cs_shortname <chr> "TE001", "TE001", "TE001", "TE001", "TE001", "TE001", "TE…
$ date <dttm> 2017-01-01 00:00:00, 2017-01-01 01:00:00, 2017-01-01 02:…
$ quality <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
$ q_kfz_mq_hr <dbl> 87, 319, 271, 180, 111, 77, 98, 108, 166, 317, 808, 1287,…
$ v_kfz_mq_hr <dbl> 77, 74, 80, 83, 88, 86, 78, 73, 68, 72, 63, 65, 88, 78, 7…
$ q_pkw_mq_hr <dbl> 86, 311, 268, 174, 108, 73, 93, 105, 147, 306, 781, 1269,…
$ v_pkw_mq_hr <dbl> 78, 76, 81, 84, 89, 86, 78, 73, 68, 72, 62, 65, 89, 78, 7…
$ q_lkw_mq_hr <dbl> 1, 8, 3, 6, 3, 4, 5, 3, 19, 11, 27, 18, 28, 22, 22, 17, 2…
$ v_lkw_mq_hr <dbl> 0, 11, 3, 56, 30, 83, 75, 83, 74, 73, 77, 76, 52, 63, 83,…
Observation and handling of missing values
When looking for missing pollutant values, it is noticeable that some stations provide very reliable data, while others provide very unreliable or no data at all. “Traffic” type stations hardly record data for O3, although this shouldn’t be too much of a problem, as this pollutant is not relevant for the calculation of the CAQI index at “traffic” stations (see section 3.3). Some ” suburb ” stations are also characterised by missing data for PM2.5 and PM10. For some other stations up to almost 70% of the values are missing for the period January 2017 to May 2023. The decision to include stations in further analysis was based on the proportion of missing data points for each pollutant. If a station has more than 25% missing observations for more than two pollutants, it is excluded from our dataset. The following table shows the relative proportion of missing values and the decision to exclude or not to exclude a station.
# Function to compute the relative amount of missing values (NA) for each station and pollutant
calc_na_amount <- function(st, df) {
df %>%
map(., ~((sum(is.na(.))/length(.))*100)) %>%
as_tibble() %>%
mutate(Station = st)
}
# Function to classify stations regarding the amount of missing data
classify_eliminate <- function(vec) {
ifelse(sum(vec == T)>2, T, F)
}
# Computing the relative amount of NA's for eacht pollutant and Station.
# Providing information whether respective Station will be excluded or not.
rel_na_amount_by_station <- airweather_by_station %>%
mutate(na_amount = map2(Station, messwerte, calc_na_amount)) %>%
pull(na_amount) %>%
bind_rows() %>%
select(Station,
pm25,
pm10,
O3,
NO2) %>%
mutate(missing_pm25 = ifelse(.$pm25 > 25, T, F),
missing_pm10 = ifelse(.$pm10 > 25, T, F),
missing_O3 = ifelse(.$O3 > 25, T, F),
missing_NO2 = ifelse(.$NO2 > 25, T, F)) %>%
mutate(missing_agg = pmap(list(missing_pm25,missing_pm10,missing_O3,missing_NO2),c)) %>%
mutate(eliminate = map(missing_agg, classify_eliminate),
eliminate = as.logical(eliminate))
rel_na_amount_by_station %>%
select(Station, pm25, pm10, O3, NO2, eliminate) %>%
kable(caption = "Relative Amount of missing values for air quality stations and deicison whether to eliminate station or not.")
# Removing not needed functions
remove(calc_na_amount,
classify_eliminate)
# Stations, providing nearly complete data
stations_values_complete <- rel_na_amount_by_station %>%
filter(missing_pm25 == F,
missing_pm10 == F,
missing_O3 == F,
missing_NO2 == F,
eliminate == F) %>%
pull(Station)
# Stations, which do not monitor O3 values
stations_O3_missing <- rel_na_amount_by_station %>%
filter(missing_O3 == T,
eliminate == F) %>%
pull(Station)
# Stations, which do not monitor PM2.5/PM10 values
stations_PM_missing <- rel_na_amount_by_station %>%
filter(missing_pm25 == T,
missing_pm10 == T,
eliminate == F) %>%
pull(Station)
# Stations, their monitoring values as well as their missing values and type
airweather_by_used_stations <- tibble(Station = stations_values_complete, missing_values = NA) %>%
rbind(tibble(Station = stations_O3_missing, missing_values = "O3")) %>%
rbind(tibble(Station = stations_PM_missing, missing_values = "PM2.5/PM10")) %>%
inner_join(airweather_by_station) %>%
left_join(airquality_station_groups, by = c("Station" = "name"))
# Removing not needed data
remove(stations_values_complete,
stations_O3_missing,
stations_PM_missing)| Station | pm25 | pm10 | O3 | NO2 | eliminate |
|---|---|---|---|---|---|
| 010 Wedding | 0.3437349 | 0.3437349 | 0.4887481 | 0.7053726 | FALSE |
| 018 Schöneberg | 100.0000000 | 100.0000000 | 100.0000000 | 0.3491058 | TRUE |
| 027 Marienfelde | 100.0000000 | 100.0000000 | 0.5048606 | 0.4976995 | FALSE |
| 032 Grunewald | 2.5368351 | 2.5332546 | 1.1529441 | 0.9524321 | FALSE |
| 042 Neukölln | 0.6337612 | 0.6337612 | 0.4457812 | 0.6606155 | FALSE |
| 077 Buch | 0.8002578 | 0.7966772 | 0.7984675 | 1.0688007 | FALSE |
| 085 Friedrichshagen | 0.8342732 | 0.8324829 | 0.8002578 | 0.7716132 | FALSE |
| 115 Hardenbergplatz | 100.0000000 | 100.0000000 | 62.3413359 | 0.7716132 | TRUE |
| 117 Schildhornstraße | 0.6516641 | 0.6534544 | 100.0000000 | 0.5675206 | FALSE |
| 124 Mariendorfer Damm | 0.3974435 | 0.3992338 | 100.0000000 | 0.2846555 | FALSE |
| 143 Silbersteinstraße | 0.2184149 | 0.2184149 | 100.0000000 | 0.4153463 | FALSE |
| 145 Frohnau | 100.0000000 | 100.0000000 | 0.6212292 | 0.6391321 | FALSE |
| 171 Mitte | 5.0915731 | 1.7168842 | 100.0000000 | 0.7393881 | FALSE |
| 174 Frankfurter Allee | 0.8020481 | 0.7948869 | 31.5555794 | 0.6427126 | FALSE |
| 220 Karl-Marx-Straße | 29.5039118 | 29.5021215 | 100.0000000 | 29.7420198 | TRUE |
| 282 Karlshorst | 100.0000000 | 100.0000000 | 100.0000000 | 0.3777503 | TRUE |
| 088 Messwagen Leipziger Str. | 52.7937412 | 53.4257121 | 53.5134361 | 53.5044847 | TRUE |
| 014 Sondermessstation | 100.0000000 | 100.0000000 | 100.0000000 | 56.1147215 | TRUE |
| 190 Leipziger Straße | 50.8906672 | 50.8906672 | 100.0000000 | 50.8942478 | TRUE |
| 221 Karl-Marx-Straße | 69.6636053 | 69.6636053 | 100.0000000 | 70.9526111 | TRUE |
The handling of missing pollutant data is unclear. Simply deleting all observations with missing values can result in a significant amount of data being lost. Although for some pollutants there are longer periods where no measurements are available, there are also very short periods of a few hours in the datasets where the data probably could not be collected correctly. Estimating data over a longer period of time is complicated because measurements may vary irregularly over many consecutive hours or days. In such a case, estimation with e.g. linear methods would run the risk of estimating data that would later do more harm than good as training data. However, over very short periods of time, the probability of large fluctuations is greatly reduced. In such a situation, it would be useful to estimate the missing data. Therefore, up to a maximum of 4 consecutive missing observations are estimated using interpolated values. In this way, the proportion of missing observations can be reduced to a certain extent without running too great of a risk of neglecting large fluctuations in the measured values.
# Function to approximate values by interpolation for given columns in a dataframe
estimate_by_interpolation <- function(df, columns, maximum_gap) {
for (col in columns) {
# if a column has no values at all, do not interpolate
if ((sum(is.na(df[[col]]))/length(df[[col]])) == 1) {
break
}
na_start <- min(which(!is.na(df[[col]]))) # index of first non NA value
na_end <- max(which(!is.na(df[[col]]))) # index of last non NA value
# approximate values between first and last NA value
df[[col]][na_start:na_end] <- na.approx(
df[[col]][na_start:na_end], maxgap = maximum_gap
)
}
return(df)
}
# Replace missing values by Interpolation for all pollutants
airweather_by_used_stations <- airweather_by_used_stations %>%
mutate(messwerte = map(.x = messwerte, estimate_by_interpolation,
c("pm25", "pm10", "O3", "NO2"),
4))In order to observe the missing traffic data, the first step was to restrict the data to a large extent. For our analysis, only traffic monitoring stations located in the immediate vicinity of used “traffic” type stations were selected. These were identified through a detailed examination. Between 2 and 3 traffic sensors were identified at each of the 4 air quality stations. In total, data from 9 traffic sensors are considered.
The traffic data set does contain some anomalies, such as records showing a speed of -1 km/h, which suggests missing or anomalous data. It was found that a speed reading of -1 km/h was in most cases due to the fact that no vehicles were recorded at a station at a particular time. When looking more closely at the speed indications with -1 km/h, hardly any observations could be found where the number of recorded vehicles was greater than 0. Therefore, it was decided that further consideration is not necessary, as the speed indication is not taken into account in the further analysis.
Calculation of CAQI-Index
The calculation of the CAQI-Index is based on the theory explained in the associated work. The pollutant concentrations are combined into a single index that can be easily understood and compared. Firstly, the sub-index for individual pollutants is calculated, then a combined index is computed. The combined index is the maximum of the individual subindex values, reflecting the pollutant with the highest concentration relative to its own threshold. Once the index is calculated, it is then classified into one of several qualitative categories (e.g., ‘very low’, ‘low’, ‘medium’, ‘high’, ‘very high’), which provide a more intuitive understanding of the air quality.
# Compute the CAQI value for a single pollutant
calculcate_single_aqi <- function(C, C_low, C_high, I_low, I_high) {
round(((I_high - I_low) / (C_high - C_low)) * (C - C_low) + I_low, 0)
}
# Compute the combined CAQI value
compute_aqi <- function(NO2 = 0, PM10 = 0, O3 = 0, PM25 = 0) {
if (!is.numeric(NO2)) {
NO2 <- 0
}
if (!is.numeric(PM10)) {
PM10 <- 0
}
if (!is.numeric(O3)) {
O3 <- 0
}
if (!is.numeric(PM25)) {
PM25 <- 0
}
NO2_index <- case_when(
between(NO2, 0, 49) ~ calculcate_single_aqi(NO2, 0, 49, 0, 24),
between(NO2, 50, 99) ~ calculcate_single_aqi(NO2, 50, 99, 25, 49),
between(NO2, 100, 199) ~ calculcate_single_aqi(NO2, 100, 199, 50, 74),
between(NO2, 200, 400) ~ calculcate_single_aqi(NO2, 200, 400, 75, 100),
NO2 > 400 ~ 101
)
PM10_index <- case_when(
between(PM10, 0, 24) ~ calculcate_single_aqi(PM10, 0, 24, 0, 24),
between(PM10, 25, 49) ~ calculcate_single_aqi(PM10, 25, 49, 25, 49),
between(PM10, 50, 89) ~ calculcate_single_aqi(PM10, 50, 89, 50, 74),
between(PM10, 90, 180) ~ calculcate_single_aqi(PM10, 90, 180, 75, 100),
PM10 > 180 ~ 101
)
O3_index <- case_when(
between(O3, 0, 59) ~ calculcate_single_aqi(O3, 0, 49, 0, 24),
between(O3, 60, 119) ~ calculcate_single_aqi(O3, 50, 99, 25, 49),
between(O3, 120, 179) ~ calculcate_single_aqi(O3, 100, 199, 50, 74),
between(O3, 180, 240) ~ calculcate_single_aqi(O3, 200, 399, 75, 100),
O3 > 240 ~ 101
)
PM25_index <- case_when(
between(PM25, 0, 14) ~ calculcate_single_aqi(PM25, 0, 49, 0, 24),
between(PM25, 15, 29) ~ calculcate_single_aqi(PM25, 50, 99, 25, 49),
between(PM25, 30, 54) ~ calculcate_single_aqi(PM25, 100, 199, 50, 74),
between(PM25, 55, 110) ~ calculcate_single_aqi(PM25, 200, 399, 75, 100),
PM25 > 110 ~ 101
)
return(max(c(NO2_index, PM10_index, O3_index, PM25_index)))
}
# Function to Compute the Qualitative Name of respective CAQI-Index values
get_qualitative_name <- function(caqi_index) {
case_when(
caqi_index %in% c(0:24) ~ "very low",
caqi_index %in% c(25:49) ~ "low",
caqi_index %in% c(50:74) ~ "medium",
caqi_index %in% c(75:100) ~ "high",
caqi_index > 100 ~ "very high"
)
}
# Compute all CAQI-Index values and qualitative names
plan(multisession, workers = 4)
air_weather_df <- airweather_by_used_stations %>%
pull(messwerte) %>%
bind_rows() %>%
mutate(
caqi_index = unlist(future_pmap(
.l = list(
if_else(is.na(NO2), 0, NO2),
if_else(is.na(pm10), 0, pm10),
if_else(is.na(O3), 0, O3),
if_else(is.na(pm25), 0, pm25)
),
compute_aqi
)),
caqi_type = get_qualitative_name(caqi_index),
.after = NO2
)
# Grouping air quality monitoring data by station
airweather_by_used_stations <- airweather_by_used_stations %>%
mutate(messwerte = map(Station, function(x) air_weather_df %>% filter(Station == x)))Temporal Trends in Pollutants and CAQI index values
To get a concrete insight into the air quality measurement data, the first step is to look at their distributions. Particular attention is paid to the mean and median values. For NO2, the mean value for all stations considered is 22.1 μg/m3. PM10 follows with a mean of 19.9 μg/m3. O3 is the most concentrated pollutant with an average of 49.0 μg/m3. The least concentrated is PM2.5 with an average of 13.2 μg/m3. The mean for PM10, PM2.5 and O3 is about 2 μg/m3 above the median. For NO2 the deviation is somewhat larger at more than 5 μg/m3. It is particularly noticeable that all pollutants show extremely strong outliers. Especially PM10 stands out with a maximum value of 1501 μg/m3. The CAQI index has mean and median values of 27.0 and 25.0. There are some “outliers” with a value of 101.0, which is the maximum value that CAQI index values can have.
pm25 pm10 O3 NO2
Min. : 1.00 Min. : 1.00 Min. : 0.00 Min. : 0.0
1st Qu.: 7.00 1st Qu.: 11.00 1st Qu.: 26.00 1st Qu.: 8.0
Median : 11.00 Median : 17.00 Median : 48.00 Median : 17.0
Mean : 13.21 Mean : 19.86 Mean : 48.97 Mean : 22.1
3rd Qu.: 17.00 3rd Qu.: 25.00 3rd Qu.: 68.00 3rd Qu.: 30.0
Max. :587.00 Max. :1501.00 Max. :208.00 Max. :220.0
NA's :117979 NA's :116080 NA's :242942 NA's :3406
caqi_index
Min. : 0.00
1st Qu.: 18.00
Median : 25.00
Mean : 26.93
3rd Qu.: 34.00
Max. :101.00
NA's :31
Looking at each type of station separately, the influence of the location and position of the station on the measured values becomes clearer. For NO2, PM10 and PM2.5 it can be seen that the values are lowest in suburban areas and highest at roads. Only O3 is higher in the suburbs than near roads. Urban background stations are on average between the other two station types for all pollutants. The CAQI index values seem to be similar at each station type (slightly higher at the suburban station), which makes sense and is in line with the aim of the index, which is to make different stations comparable.
The differentiation between weekdays and weekends shows clear differences in the mean pollutant concentrations for all station types. For all pollutants except O3, higher values are found on weekdays. O3, on the other hand, is measured with higher values on weekends. The difference between the station types is the fact that the differences between weekdays and weekend days are significantly greater along roads and in the city than in suburban areas. For suburban and background stations, average CAQI index values appear to be very similar throughout the week. At traffic stations, significantly higher average index values can be observed on weekdays than on weekends.
When distinguishing between the seasons, clear differences can be seen, especially for NO2 and O3. While for NO2 significantly lower values are measured in spring and summer than during winter and fall, the opposite effect can be observed for O3. Here the values are generally higher in spring and summer than during winter and fall. In the light of previous observations comparing NO2 and O3, a counteracting effect may be suspected here. For PM10 and PM2.5, values tend to be somewhat lower in summer and spring than during winter and fall. For CAQI index values, one can observe significantly higher values in spring and summer than during fall and winter.
Patterns in Pollutant Concentrations Over the Course of a Day
Looking at the average pollutant concentrations over the course of the day, the already mentioned temporal trends are examined in more detail. It is noticeable that for NO2, especially on weekdays, there is a strong increase in concentration during morning hours. A similar but much weaker increase can be observed in the late afternoon. As this observation is mainly visible for the stations of the “traffic” type, one can believe that these are emissions from rush-hour traffic. The fact that these variations are not visible on weekends reinforces this assumption. It can also be seen that the increase of NO2 in the morning hours leads to a strong decrease of O3 measurements. However, this decrease is not visible in the afternoon. This phenomenon will be examined in more detail below. For PM10 and PM2.5, an increase of the measured values in the morning is also observed on weekdays, which decreases continuously during the day. On weekends, there is no significant change in particulate matter levels.
Relationships between Pollutants, Weather Variables, and Traffic
Trends in Essential Weather Variables
By looking at the monthly trends in essential weather variables it can be observed that the temperature at 2 meters shows a clear seasonal pattern. With lower temperatures during the winter period (January to March and October to December) and significantly higher temperatures in the summertime (April to September). The CAQI index appears to have a negative correlation with temperature. It has higher values in months with low temperatures and lower values during months with higher temperatures. This suggesting that air quality tends to be worse when it’s temperatures are low and could be due to factors such as increased emissions from heating and less dispersion of pollutants due to stable weather conditions in winter. The wind speed at 100 meters appears to be higher during winter periods and spring months (January to May) and lower in the summertime and fall months (June to December). The CAQI index shows a somewhat opposite trend to wind speed. It is lower when the wind speed is high and vice versa. This suggests that wind can help disperse pollutants and possibly improve air quality. The amount of precipitation is relatively low in the winter months (January to March), starts to increase in the spring (April to June), peaks during summer months (July to September), and decreases during fall (October to December). The CAQI index doesn’t show a clear relationship with precipitation. It seems to be relatively high during months with both low and high precipitation.
Traffic Flow Trends and Relations
In order to identify trends in traffic flows and correlations between traffic and air quality data, traffic data were merged with pollutant data as part of the exploratory data analysis. Traffic data were only linked to air quality stations of the category “traffic”. For each station, data from nearby traffic sensors were taken into account.
Looking at the hourly trends of the CAQI index, a diurnal pattern can be seen with lower values during the early morning time and higher values later during a day. A peak can be observed withing the late afternoon hours. The hourly traffic flow also shows a diurnal pattern with peaks in the morning and late afternoon, probably due to rush hour traffic. Looking at the monthly trends in the CAQI index, there is a seasonal pattern with higher values in the colder months of the year and lower values in the warmer months. This could be due to various factors such as changes in weather conditions and human activities. The traffic flow seems to be slightly higher in the summer than during wintertime. One can also observe a slight decrease in August, which might be caused by national summer holidays. When comparing weekdays with weekends, the CAQI index seems to be slightly lower on weekends (days 6 and 7) than on weekdays. As expected, the traffic flow rate is significantly lower during saturdays and sundays compared to other days of the week. The addressed trends suggest that traffic could be a contributing factor to air pollution levels as measured by the CAQI index, especially during weekday rush hours. However, the relationship is not straightforward and could be influenced by various other factors such as weather conditions and seasonal variations. Further analysis could look at the interaction effects between these variables.
Correlation matrices make it easy to identify various linear relationships between different variables. By combining air quality measurements, weather data, and, in the case of “traffic” stations, traffic data, linear relationships can be observed. When having a closer look, however, it quickly becomes clear that there may be differences between groups of stations and stations themselves. One station may be in the shade and protected from sunlight. Another station may be well protected from the wind. Therefore, depending on the individual conditions, values may vary from station to station. However, these are aspects that cannot be explored in depth within the scope of this work. In the following, the correlation matrices for the respective station types are examined.
By examining both suburban and background stations, we find that PM10, PM2.5 and NO2 exhibit strong positive correlations, while these pollutants are negatively correlated with O3. Wind speed, in particular, seems to play a significant role in affecting pollutant concentrations, as it appears to have negative correlations with all pollutants. This could suggest that higher wind speeds help disperse pollutants and enhance air quality.
The CAQI index values have positively correlations with several variables including the duration of daily sunlight, the dewpoint, and the temperature, as well as negative correlations with the relative humidity at two meters. This could indicate that drier, warmer, and sunnier conditions may be associated with worse air quality. Intrestingly, all pollutants apart of O3 show inverse correlations with temperature, dewpoint, and daily sunlight, and a positive correlation with relative humidity.
Moving to the traffic stations, the linear correlations between weather variables and pollutants are generally much weaker than at other station types. This observation could suggest that air quality at traffic stations may be influenced by several other factors not included in our data. However, wind speed still has a significant negative correlation with all pollutants as well as the CAQI index, reinforcing the idea that wind may help disperse pollutants.
Traffic data, specifically the flow of cars and trucks, shows a positive correlation with NO2 and the CAQI index. This suggests that increased traffic might be associated with worse air quality. However, these relationships are not very strong.
Creation of interaction features
By creating interaction features, one can see whether a the effect of one variables on the outcome variable depends on the value of another variable. Thus, ab bit more complex relationship can be observed.
Interaction features between the variables of PM2.5, PM10, O3, NO2, windspeed, temperature and relative humidity were created to examine their relationship with the CAQI-Index. Those are created by multiplying the values of the interacting variables. Then, the correlation between the interaction features and the CAQI index can be computed. By looking at the following correlation values, it can be observed that the interaction between all pollutants (especially O3) and temperature have positive correlations with the CAQI index. All interactions between pollutants (excluding NO2) and relative humidity as well as windspeed also have positive correlations with the CAQI index. This suggests that wind speed, temperature and relative humidity may influence the effect of these pollutants on the CAQI index. These results suggest that the effect of pollutants on CAQI index are much likely to be influenced by weather variables. However, such relationships are complex and a full capturation of all factors explaining these effects could only be explained by more complex algorithms and most probably additional features not captured in our data.
| Interaction Feature | Correlation with CAQI index |
|---|---|
| pm25_windspeed_100m | 0.3121248 |
| pm25_temperature_2m | 0.2924907 |
| pm25_relativehumidity_2m | 0.3570096 |
| pm10_windspeed_100m | 0.3877188 |
| pm10_temperature_2m | 0.3641073 |
| pm10_relativehumidity_2m | 0.4494474 |
| O3_windspeed_100m | 0.4842067 |
| O3_temperature_2m | 0.6830223 |
| O3_relativehumidity_2m | 0.4928495 |
| NO2_windspeed_100m | 0.0347842 |
| NO2_relativehumidity_2m | 0.0889495 |
| NO2_temperature_2m | 0.1274054 |
To conclude, the deep dive into Berlin’s air quality proved to be an analysis revealing patterns in pollution levels, weather and traffic data. Initial challenges around missing pollutant data were tackled by eliminating stations that were significantly deficient in their observations. Other missing values were filled through careful interpolation, keeping the data integrity intact. The established CAQI-Index methodology was utilized to combine different pollutant levels into one representative figure. By breaking down these index values into understandable categories, one could gain a relatable and comparative insight into air quality in the city of Berlin. The temporal analysis showed trends that varied based on station locations, weekdays, weekends, and seasons. Traffic-heavy areas recorded more pollutants, especially NO2, as compared to urban background and suburban stations. Regular workdays reported higher pollutant levels, with the exception of O3, which rose during the weekends. Seasonally, NO2 peaked during the chillier fall and winter, while O3 levels climbed during spring and summer. Similar patterns were mirrored in the CAQI index values, underlining the influence of individual pollutants on overall air quality. The exploratory data analysis also considered the interplay between pollutants, weather conditions and traffic patterns. Elements such as wind speed, temperature, and humidity showed significant correlation with pollution levels. Wind speed particularly stood out for its negative correlation, suggesting a role of diffusing pollutants and possibly enhancing air quality. Traffic was associated with higher NO2 and CAQI index levels, indicating a link between heavier traffic and declining air quality. Crucially, we found that the impact of pollutants on the CAQI index was moderated by weather variables. The interaction between pollutants and factors like temperature, humidity, and wind speed showed positive correlations with the CAQI index, suggesting the interplay of these elements in shaping air quality. In essence, this exploratory data analysis paves the way for creating more complex machine learning models, with the goal of predicing air quality in Berlin.
Session info
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggmap_3.0.1 furrr_0.3.1 future_1.28.0 corrr_0.4.4
[5] zoo_1.8-11 knitr_1.40 lubridate_1.8.0 forcats_0.5.2
[9] stringr_1.4.1 dplyr_1.1.2 purrr_0.3.5 readr_2.1.3
[13] tidyr_1.2.1 tibble_3.2.1 ggplot2_3.4.0 tidyverse_1.3.2
loaded via a namespace (and not attached):
[1] httr_1.4.4 bit64_4.0.5 vroom_1.6.0
[4] jsonlite_1.8.3 modelr_0.1.9 assertthat_0.2.1
[7] highr_0.9 sp_1.5-1 googlesheets4_1.0.1
[10] cellranger_1.1.0 yaml_2.3.6 globals_0.16.1
[13] pillar_1.9.0 backports_1.4.1 lattice_0.20-45
[16] glue_1.6.2 digest_0.6.30 RColorBrewer_1.1-3
[19] rvest_1.0.3 colorspace_2.0-3 htmltools_0.5.5
[22] plyr_1.8.7 pkgconfig_2.0.3 broom_1.0.1
[25] listenv_0.8.0 haven_2.5.1 scales_1.2.1
[28] jpeg_0.1-9 tzdb_0.3.0 googledrive_2.0.0
[31] farver_2.1.1 generics_0.1.3 ellipsis_0.3.2
[34] withr_2.5.0 cli_3.6.1 magrittr_2.0.3
[37] crayon_1.5.2 readxl_1.4.1 evaluate_0.17
[40] fs_1.6.2 fansi_1.0.3 parallelly_1.32.1
[43] xml2_1.3.3 tools_4.2.1 hms_1.1.2
[46] RgoogleMaps_1.4.5.3 gargle_1.2.1 lifecycle_1.0.3
[49] munsell_0.5.0 reprex_2.0.2 compiler_4.2.1
[52] rlang_1.1.1 grid_4.2.1 rstudioapi_0.14
[55] htmlwidgets_1.6.2 labeling_0.4.2 bitops_1.0-7
[58] rmarkdown_2.17 gtable_0.3.1 codetools_0.2-18
[61] curl_4.3.3 DBI_1.1.3 R6_2.5.1
[64] gridExtra_2.3 bit_4.0.4 fastmap_1.1.0
[67] utf8_1.2.2 stringi_1.7.8 Rcpp_1.0.9
[70] parallel_4.2.1 vctrs_0.6.3 png_0.1-7
[73] dbplyr_2.2.1 tidyselect_1.2.0 xfun_0.39